Towards Fully FP8 GEMM LLM Training at Scale

Introducing FOG Architectures for Efficient and Stable Low-Precision Transformer Training

Published

May 26, 2025

Authors: A. Hernández-Cano et al.
Published on Arxiv: 2025-05-26
Link: http://arxiv.org/abs/2505.20524v1
Institutions: EPFL • ETHZ
Keywords: FP8, GEMM, Large Language Models, LLM training, Transformers, FP8DPA, FOG architecture, Delayed scaling, Kurtosis, Outlier mitigation, Deep learning efficiency, Language modeling, Benchmarking, FineWeb-Edu, BF16, Transformer Engine, Megatron-LM

Random Unsplash-style image

Efficient training of large language models (LLMs) demands significant computational resources, prompting growing interest in the use of lower-precision numerical formats like FP8 to accelerate training and reduce resource requirements. Recent adoption of FP8, however, has been hampered by training instabilities, largely due to FP8’s narrow dynamic range and outlier activations encountered in LLMs. Most existing FP8 training methods limit FP8 application to certain components, relying on higher-precision formats for critical matrix multiplications, particularly in attention mechanisms, which restricts the full potential of throughput gains from FP8.

To address these challenges, the authors present their proposed solution, approach, and the main contributions:

Having defined the approach and innovations, the results of the study are presented as follows:

Upon reviewing these results, the study concludes with the following key takeaways: